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The present invention relates generally to video signal processing. More 
particularly, the invention relates to a video indexing and image retrieval system. 

Over the last few years, there has been a dramatic increase in available 
5 bandwidth over high-speed networks. At the same tune, computer manufacturers 
improved the storage capacities of hard drives on personal computers, and improved 
the speed of the system bus and the motherboards that access the hard drives. The 
quality and efficiency of data compression algorithms has likewise improved 
information transmission efficiency and access rates - particularly with respect to video 
10 data. 

One of the most unportant tasks of a database manager is to provide easy and 
intuitive access to data. This task can be particularly difficult when a user would like 
to search for images or other visual data such as video segments. Browsing and 
searching for data is one useful way that the Internet allows users to access related 

15 Internet pages rapidly and intuitively through text-based searches. 

To allow the searching of visual data, an image retrieval system must be able to 
emphasis the similarity of a query image with images or frames of video stored in a 
database. There are several ways that a user may provide a query image. For 
example, users may have a rough idea of an image that they are looking for. The user 

20 may develop a simple sketch of an image by hand and a scaimer can be used to upload 
the sketch or drawing software can be used. A photo of the image or a similar image 
can be used to fmd other similar images in the database. 



An image search engine must be able to generate a measurement of the 
similarity between the query image and database images so that the user is presented 
with a list of the most relevant database images to the least relevant images. The 
image search engine associated with the image retrieval system must be able to look 
5 for similarities between significant features of the query sketch or image and the 
database images while ignoring minor detail variations. In other words, the image 
search engine must measure the visual similarity between the query image and the 
database images invariantly. 

When searching the video sequences, it would be inefficient for the image 

10 retrieval system to compare the query image to every frame of the video sequence. A 
video sequence typically contains one or more shots. A shot is a sequence of related 
frames that are taken by one camera without interruption. To avoid the inefficiency, 
the image database manager must take the time to segment the shots and identify a key 
frame to represent the shots. To simplify this problem, it is desirable to perform video 

15 segmentation and key frame identification automatically. 

A first step towards automatic video indexing is the ability to identify both 
abrupt transitions and gradual ttansitions. An abrupt transition is a discontuiuous 
transition between two images and is also referred to as a cut fransition. Gradual 
transitions include fade, dissolve, and wipe fransitions. When an image gradually 

20 disappears into black or white or gradually appears from black or white, a fade 
transition occurs. When an image gradually disappears at the same thne that another 
image gradually appears, a dissolve transition occurs. When a first image gradually 
blocks a second image, a wipe transition occurs. Gradual transitions are composite 
shots that are created from more than one shot. 
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A shot transition detector must be sensitive to both abrupt transitions and 
gradual transitions to be successful in an automatic video indexing system. The shot 
transition detector should also be insensitive to other changes. In other words, the shot 
transition detector should ignore small detail changes, image motion and camera 
5 motion. For example, pamiing, zooming and tilting should not significantly impact the 
query results. 

Conventional video retrieval systems have employed several different types of 
cut transition detection techniques includmg histogram difference, frame difference, 
motion vector analysis, compression difference, and neural-network approaches. 

10 Frame differencing detection systems are extremely sensitive to local motion. 
Histogram detection systems successfully identify abrupt shot transitions and fades, but 
work poorly on gradual transitions such as wipe and dissolve. Motion vector detection 
systems require extensive computations that are prohibitive when large image 
databases are used. Neural-network detection systems do not provide improved 

15 performance over the other cut detection systems. Neural-network detection systems 
require significant computations for the neural-network framing process. 

Additional algorithms have been proposed that address gradual transitions such 
as fade, dissolve, and wipe transitions. Edge tracking systems measure the relative 
values of entering and exiting edge percentages. Edge tracking systems are able to 

20 correctly identify less than 20% of gradual transitions. Edge tracking systems require 
a motion estimation step to align consecutive frames which is computationally 
expensive. The performance of the edge tracking system is highly dependent upon the 
accuracy of the motion estimation step. Chromatic scaling systems assume that fade in 
transitions and fade out transitions are to and from black only. Chromatic scaling 



systems also assume that both object and camera motion are low immediately before 
and after the transition period. 

A video segmentation system according to the invention includes a video source 
that provides a video sequence with a plurality of frames. The video segmentation 
5 system generates an S-distance measurement between adjacent frames of the video 
sequence. The S-distance measurement gauges the similarity between the adjacent 
frames. 

A frequency decomposer that preferably employs wavelet decomposition 
generates a low frequency and a high frequency signature for each frame. A cut 

10 detector identifies cut fransitions between two adjacent frames using the low frequency 
signamre. A cut detector generates a difference signal between coefficients of the low 
frequency signature for adjacent frames and compares the difference signal to a 
threshold. If the difference signal exceeds the threshold, a cut transition is declared. 

After identifying the cut fransitions, the video segmentation system according 

15 to the mvention employs a fade detector that identifies fade transitions using the high 
frequency signatures for frames located between the cut fransitions. The fade detector 
includes a summing signal generator that sums the coefficients of the high frequency 
signature for each frame and compares the sum signal to a linear signal which is an 
uicreasmg function for fade in and a decreasing fiinction for fade out. A dissolve 

20 transition detector employs the high frequency signature to identify potential dissolve 
transitions. A double frame difference generator confirms the dissolve transitions. As 
can be appreciated, the video segmentation system according to the invention 
dramatically improves the identification of abrupt and gradual transitions. The video 
segmentation system achieves segmentation m a computationally efficient manner. 



An image retrieval system according to the invention also employs an S- 
distance measurement to compare a query image to images located in a database. The 
S-distance measurement is used to allow a user to search and browse m a manner 
similar to text-based systems provided by the Internet. 

BRIEF DESCRIPTION OF THE DRAWINGS 
FIG. 1 illustrates a video sequence that includes a plurality of shots; 
FIG. 2 illustrates multiple frames of a video sequence that are associated with a 
cut transition; 

FIG. 3 illustrates multiple frames of a video sequence that are associated with a 
fade transition; 

FIG. 4 illustrates multiple frames of a video sequence that are associated with a 
dissolve transition; 

FIG. 5 is a functional block diagram of an automatic video indexing system 
according to the invention; 

FIG. 6 is a functional block diagram illustrating a cut transition detector of 
FIG. 5 in fiirther detail; 

FIG. 7 is a flow chart diagram for the cut transition detector of FIG. 6; 

FIG. 8 illustrates a fade detector of FIG. 5 in further detail; 

FIG. 9 is a flow chart diagram illustrating fade transition detection for the fade 
transition detector of FIG. 8; 

FIG. 10 is a functional block diagram illustrating a dissolve transition segments 
collector of FIG. 5 in further detail; 



FIG. 11 is a flow chart diagram illustrating the operation of the dissolve 
transition verifier; 

FIG. 12 is a functional block diagram of one embodiment of an image retrieval 
system; and 

5 FIG. 13 is a functional block diagram of a second embodiment of the image 

retrieval system. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

Referring to FIG. 1, a video sequence 10 is illustrated and includes a pliirality 
of shots 12-1 to 12-n each including one or more frames 16-1 to l6-m. The video 
10 sequence 10 includes n shots and m frames. An automatic video indexing system 
according to the invention is preferably capable of identifying both abrupt and gradual 
transitions between the n shots 12. After identifying the transitions between the n 
shots, a key frame can be selected for each shot 12 for video indexing, retrieval and 
other uses. The key frame can be a first frame in the shot, a middle frame or a 
15 combination of frames. Not all transitions between the n shots are easy to identify. 
For example, a transition between shot n-1 and shot n is a cut ttansition 20. A 
transition between shot 1 and shot 2 is a dissolve fransition 22. 

FIGS. 2-4 illustrate frames associated with both abrupt and gradual shot 
transitions. Referring now to FIG. 2, a first frame 30 of a video sequence starting at 
20 time / is followed by a second frame 32 starting at time t+1. Because of the abrupt 
transition between the frames 30 and 32, a cut transition is designated at time t+1 
(identified at 34) in FIG. 2. 

FIG. 3 illustrates n frames of a fade out fransition 40. Frame 44 occurs at time 
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t and an image is readily distinguishable. Frame 46 occurs at the time t+1 and the 
visibility of image 42' is somewhat reduced relative to the image 42 in frame 44. At 
time t+n-1, the visibility of the image 42" is fiirther reduced until at time the 
image 42 generally disappears into a single color, such as black or white. A fade in 
5 transition would be accomplished in reverse. 

Referring now to FIG. 4, a dissolve transition 54 is illustrated and includes n 
frames. At time t, the visibility of an image 56 in frame 58 is relatively high. At time 
t+l, the visibility of the image 56' in frame 60 is reduced and a second image 62 
becomes visible and has a relatively low visibility. At time t+n-1, the visibility of 
10 image 56" in frame 66 has decreased and the visibility of the image 62' has increased. 
At time the image 56 in frame 70 has disappeared and the visibility of the image 
62" has increased. 

Referring now to FIG. 5, an automatic video indexing system 80 for detecting 
abrupt and gradual transitions is illustrated. The automatic video indexing system 80 

15 includes a processor 84 that is connected to memory 86 and an input/output interface 
90. The memory 86 includes read only memory (ROM), random access memory 
(RAM), optical storage, hard drives, and/or other suitable storage. The automatic 
video indexing system 80 includes a source of video sequences such as a local video 
(image) database 92 or distributed video (image) databases such as a video (image) 

20 database 94 that is available through a local area network (LAN) 96 or a video (image) 
database 98 that is available through a wide area network (\VAN) 100 which can be 
connected to the Internet. The automatic video indexing system 80 also includes 
input/ output (I/O) devices 104 such as a keyboard, a mouse, one or more displays, an 
image scanner, a printer, and/or other I/O devices. 



A video sequence selector 110 allows the user to select one or more video 
sequences 10 that may be stored in the video (image) databases 92, 94 and/or 98. 
Video sequence selection can be performed in a conventional manner through dialog 
boxes, which are navigated using mouse and/or keyboard selections. An image 
5 extractor 114 extracts a thumbnail direct current (DC) image for each frame of a 
selected video sequence. A frequency decomposer 116 is connected to the image 
extractor 114 and generates a frequency domain decomposition of each thumbnail DC 
image. The frequency decomposer can employ fast Fourier transform (FFT), discrete 
cosign transform (DCT), discrete Fourier fransform (DFT), or wavelet fransformation. 

10 Due to the computational efficiency of wavelet transforms such as Haar wavelet 
transforms, wavelet transformation is preferred. 

When Motion Picture Experts Group (MPEG) video sequence sources are 
employed, they typically have a frame size of 512 by 512 pixels or larger. Generally 
the DC image is generated for 8 by 8 pixel blocks. A typical thumbnail MPEG image 

15 is 512/8 by 512/8 or 64 by 64 pixel blocks or larger. Decomposition of the MPEG 
thumbnail frame images using wavelet fransforms has been found to be a sufficient 
input for generating high and low frequency domain components. This technique is 
advantageous since the thumbnail DC-image is much easier to extract from the MPEG 
video as compared to AC coefficients. By employing only the thumbnail DC 

20 components, lower computational time is required. 

A low frequency component (LFC) signature generator 120 generates a LFC 
signature for each image. A high frequency component (HFC) signature generator 124 
generates a HFC signature. In one embodiment, wavelet fransformation is employed 
to generate the LFC and HFC signatures. The HFC and LFC signatures are generated 



as follows: 

F is a representation of the host content J. F'is a function of F after processing 
such as compression or blurring. If automatic video indexing and image retrieval are 
desired, then the following need to be identified. 

5 — = 0 or F(t+1)-F(t) = 0 for video segmentation; and 

dt 

F(£l) - F(S„) = 0 for retrieval of query image Q from the database containing 
images Si, ... Sn... 3j. 

S represents a wavelet transform of J. 5' is the image ^in the wavelet 
domain with the small coefficients set to zero. Studies on visual data compression 

10 indicate that visually S'-S^O when the small coefficients of the wavelet 
transformation of image ^are discarded. Suppose is a feature extracted from 
5 'that invariantly preserves the visual content of J. Then visually we have 
F(S') - F(S) ^ 0 . F can therefore be used as a discriminant function such that 
F (S^) - F (Si^) > F (3^) - F (S^) - where Sa and 5b are two different images 

15 (frames) that are visually different 

For video segmentation and image retrieval, the overall content change 
between two images or two video frames needs to be measured. The measurement 
should reflect the overall structures of two unages or frames. S-distance is a 
measurement of the distance between two unages or frames in the wavelet domain. 

20 S-distance gives a measurement of how many significant LFCs and/or HFCs of two 
images are in common. As a result, S-distance provides a good measurement on the 
overall similarities and/or differences between two images or frames and can be 
used for browsing and searching images as will be described further below. 



Vt (te[0,n]) represents a frame t of video sequence V and ^represents an 
image. v\ andv^ represent shot 1 and shot 2. The image size is X by Y. It(x,y) is 
the intensity of the (x,y)th coefficient of frame t and I(x,y) to be the intensity of 
{x,y)th coefficient of image / where, x g[1,X] , and y&[l,Y]. 

To define S-distance, wavelet transformation is performed on two images. 
For example, the two images can be the query and the target images for image 
retrieval or two consecutive frames for video segmentation. The wavelet coefficient 
of image It(x,y) is denoted as IXx,y) ■ The LFC signature Sl and the HFC signature 
Sh of each frame/image are defined as follows: 



SL(It) ={S{I,{x,y)))-- 



{x, y) e Fj for video 



SL(^={SiI{x,yy))-- 



^(/(0,0)) ^(/,(1,0)) 
^(7(0,1)) 



{x,y)e±iox images. 



3{I,{x\y')) S{J,{x' + l,y')) 
S(l,{x',y' = \)) 



(x, y) G Vf for video 



frames, and 



Sh(4 = S{I{x,y)))-- 



^S(l(x',y')) S(I,ix' + l,y')) 
3(Tix',y' + l)) 



(x, y) e 5h for images 



where S(I (x, y)) = n, when s„_^ < I {x, y)<£„, and n = 0,l,2,... Sl (Sh) represents 
the low (high) frequency subband - so does VtL(VtH). Notice here that I(x,y) is the 
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single channel of multi-channel intensity function. After finding F (the LFC and 
HFC signatures) which is the discriminate signature of J, which is also a feature 
extracted from 5', algorithms that utilize this feature can be used for video 
processing applications as described further below. 

5 Referring back to FIG. 5, a cut transition detector 130 that is connected to the 

LFC generator 120 identifies the cut transitions in a video sequence. A fade detector 
134 that is connected to the HFC generator 124 identifies starting and endpoints of the 
fade transitions in the video sequence. A dissolve segments collector 138 is connected 
to the HFC generator 124 and identifies potential starting and ending pomts of the 

10 dissolve transitions of the video sequence. 

A dissolve transition verifier 142 confirms the existence of the potential 
dissolve transitions identified by the dissolve segments collector 138 usmg a double 
frame differencing (DFD) algorithm. A segmented data generator 142 collects the cut 
transition data, the fade transition data, and the dissolve transition data identified by the 

15 cut detector 130, the fade detector 134, and the dissolve transition verifier 142. The 
segmented data generator 142 transmits the transition data to the interface 90 for 
storage withm the video databases 92, 94, 98, or for transmission to other computers 
or I/O devices 104. 

Referring now to FIG. 6, the cut transition detector 130 is connected to the 
20 LFC generator 120 and includes an S-distance difference generator 150, a cut 
threshold generator 152, and a comparator 154. A cut transition data collector 154 
collects the cut transitions for the video sequence. A smoothing filter (not shown) may 
be used on an output of the S-distance difference generator 150 or the comparator 154 
if desired. 

11 



Referring to FIG. 7, the operation of the cut detector 130 is illustrated m 
further detail. In step 160, a value for the cut threshold is set. In step 162, the S- 
distance function difference is calculated. 

To calculate S-distance, weightmg functions are applied to the LFC and the 
5 HFC signatures. Then, the S-distance function difference is calculated for the frames 
of the video sequence. The S-distance difference is calculated for pairs of consecutive 
frames, such as frame t and frame t^l. S-distance measures the distance between two 
images or two consecutive frames of the video sequence by taking the difference 
between LFC and HFC signatures after weighting functions are applied: 

S{t,t + 1) = ^,S,{t,t + 1) + ^HS„(t,t + 1) = ^,\S,{t + l),S, (0| + ^„\S^{t + \\S„{t)\ 

where ,0^ ,and are weighting functions. 

When identifying cut sequences, the high frequency signature components 
and/or the weighting function are set to 0. For any consecutive frames, if 
S(t,t+1)> d where d is the cut threshold, then from frame / to frame t+1 (or 
15 alternatively T^^t to =t + 1) observes a cut transition, otherwise no cut fransition 
is observed. 

In step 164, the S-distance fimction difference for frames t and t+1 is 
compared to the cut threshold. If the S-distance ftmction difference exceeds the cut 
direshold as determined at step 166, a cut transition is declared at step 168. If not, a 
20 cut transition is not declared at step 170. Additional pairs of frames for t+2, t+3, 
t+nare handled similarly. 

Referring now to FIG. 8, the fade fransition detector 134 is illustrated in 
further detail. The fade transition detector 134 is connected to the high frequency 



component signal generator 124 and includes a summing signal generator 180, a linear 
function generator 184, and a comparing circuit 188. Because the cut transition 
detector 130 has identified cut transitions, the fade detector 134 analyzes only the sub- 
sequences of the video between two consecutive cuts. 

5 The changmg characteristics of the video frames withm fade and dissolve 

transition can be modeled as: 

E{t) = F{v')T]{t) + F{v^){\ - Tjit)) + C, € {t^,t^) 
where E(t) is a characteristic function; F F(t;/) represent unedited moving 
image sequence characteristic functions of two consecutive shots,; tj( t ) is a decreasing 

10 function with 77(^0) = 1 and rjitj = 0; C is the constant (or background) image such 
as text, label or logo which exists in all frames within a shot transition; and to, tN are 
the starting and ending points of a transition. 

During a fade out, the second sequence is absent and F(t',^) = Ofor 
\/t&(t^,tj,) . In a fade in, F(i^/) = 0 ior &(tQ,t^ ) /i.e., 

^/«..-,„(o=m')a-7(0)+c 

During a dissolve, both Si(t) and Siit) are not equal to zero. 

E,,.oi.eit) = F{ol)Tj{t) + Fio,')(l - v(0) + C 
Examples of the changing characteristic functions E(t) include the changing intensity 
20 function I(x,y, t) and the edge intensity function G(x,y, t) . 

E(x, y,t) = I, (x. y,t)T?(t) + IJx,y,t)(l-J]( t)) + IJx, y) 
E(x, y, t) = Gj (x, y, t)jj(t) + (x, y, t)(\ - tj( t)) + Gc (xj) 
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where h and h define the intensity of the fnst and second unedited movmg image 
sequences; Gi and G2 are the pixel intensity fiinction of the corresponding edge image 
sequences of image sequences h and I2; L and Gc represent the pixel mtensity function 
of the constant image and the constant edge image, respectively. Notice that m the 
5 above equation Gk(x,y,t) = 0 when (x,y) is not an edge point. Hence only the edge 
points in the edge image will contribute to this characteristic function. 

Based on the number of frames in a shot to be analyzed for a fade transition, 
the values for a decreasing function and a weighting function used for fade transition 
detection are set by the linear function generator 184 m step 190 in FIG. 9. At step 
10 194, the high frequency coefficients for each frame are summed. 

(t) =1 (0 1= Y ix,y) ^(^t y))^ y') ^ ^tH ■ At step 196, the difference curcuit 
188 generates a difference between the decreasing function output by the linear 
function generator 184 with the sum output by the summing signal generator 180 for 
each frame. If the difference is approximately zero for each frame in the shot 
15 sequence, as determmed at step 200, i.e. , if 5^ (0 - (T, )?7{t) «0,tG[T„T^], then 
a fade-out is declared at step 202. If not, the difference circuit subttacts one minus the 
decreasing function from the sum at step 204. If the difference is approximately equal 
to zero for each frame in the shot as determined at step 208, i.e., if 
Sfj(t)-Sf,(T^)0--7^(t))^0,te[T^,T^] then a fade-in is declared at step 210. 
20 Otherwise, neither a fade-m transition nor a fade-out transition are declared at step 
212. 

Referring now to FIG. 10, the dissolve transition segments collector 138 is 
illustrated in further detail. The dissolve fransition segments collector 138 is connected 
to the HFC generator 124. The high frequency coefficients for each frame are 
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summed, ^^(0 =1 -5// (0 H Z ^ ^^^^^ dissolve signal 

generator 224 generates the ideal dissolve function that is a changing statistical 
function. A smoothing filter 228 smoothes the summed HFC. A difference chcuit 
229 generates a difference between the output of the ideal dissolve signal generator 224 

5 and the filtered and summed HFC output. If the difference is approximately zero as 
determined at 230, i.e., if SH(t) - Sh(To) Ti(t) « 0, t e[To, Tnq] and SH(t) - Sh(To) (1- 
il(t)) » 0, t e[TN/2,TN], then To and Tn are declared as potential starting and ending 
points of a dissolve. Experimental results show that the HFC more accurately predict 
fades and dissolves as compared to color histogram, frame differencing, and motion 

10 vector analysis. Generally SH(t) of a dissolve transition is "U"-shaped with the center 
of the "U" bemg a local minima identifying a mid-point of a potential dissolve 
transition and local maxima on both sides thereof identifying starting and ending points 
of the potential dissolve transition. 

Referring now to FIGS. 10 and 11, the dissolve transition verifier 142 is 

15 illustrated in further detail. The dissolve transition verifier 142 is connected to the 
dissolve transition segments collector 138 and receives the potential starting and ending 
points of dissolve transitions therefrom. The dissolve transition verifier 142 is 
connected to the LFC signature 120 and includes a double frame difference (DFD) 
generator 250 which is connected to the output of the dissolve transition segments 

20 collector 138. 

An ideal dissolve has a "V"-shaped mtensity function and has no local motion 
or camera motion in the sequence. The change of intensity of the first shot has a 
negative slope and is linear. There exists a frame uwith its intensity I(x,y,ik) equal to 
the average intensity of the starting and ending frames I(x,y,ik) and I(xj,ik)) of the 

15 



dissolve when N= 2m + l. That is , 



2 



(Note when N=2m (m is an integer), ik is then a pseudo frame.) The DFD of frame id 
of a moving image sequence / is defined as the accumulation of a pixel by pixel 
5 comparison between this average and the intensity of frame id, where id is a frame in a 
potential dissolve fransition segment. 



The dissolve transition verifier 142 fiirther includes a smoothing filter 254 
which smoothes the output of the DFD signal. A verified dissolve fransition collector 

10 256 stores the verified dissolve fransition data for a video sequence. 

Referring to FIG. 11, at step 260, the DFD signal generator 250 computes the 
DFD signal on the LFC signature for the startmg and ending points provided by the 
dissolve segments collector 138. At step 264, the smoothmg filter 254 filters the data 
provided by the DFD signal generator 250. At step 266, the slope of the DFD signal 

15 is used to identify whether the DFD signal is concave (i.e. if the DFD(t) - 
DFD(To)ri(t) « 0, t e [To, Tn/2], and DFD(t)- DFD(TN)(l-ri(t)) « 0, t e [Tn/2, Tn]) and 
whether the depth of the concavity exceeds a threshold. If both are present, a dissolve 
transition is declared at step 208. If one or both are not present, then a dissolve 
transition is not declared at step 269. 

20 As can be appreciated, the automatic video uidexing system 80 automatically 

indexes video sequences with a high probability of identification of both abrupt and 
gradual shot transitions. Furthermore, a key frame can be selected from each shot for 
image retrieval and shot summary by selecting a first frame, an intermediate frame, or 



DFD(i,)=^f[\ 



(.x,y) Vl 



\ lix,y,i^) + I(x,y,i^) 
2 



-ii^,yjJ 
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a combination of frames. 

Referring now to FIG. 12, an image retrieval system 300 is illustrated. 
Reference numbers from FIG. 5 have been utilized to identify similar elements in FIG. 
12. The image retrieval system 300 includes a query image capture device 310 which 
5 employs I/O devices 104. The query image can be input using I/O devices 104 such as 
a scanner for capturing a photograph or sketch. Drawmg software associated with 
processor 84 and memory 86 may also be used to input a sketch. Alternately, the 
query image can be selected on the Internet, input usmg portable storage media, stored 
on a hard drive or selected from the image databases 92, 94, and/or 96. Other suitable 

10 query image sources will be apparent to skilled artisans. The query image capture 
device 310 is connected to the frequency decomposer 322 which provides frequency 
decomposition of the query image usmg wavelet transformation, DFT, DCT, FFT, or 
other suitable frequency domain transformation. Preferably, however, wavelet 
decomposition using Haar ttansformation is employed. 

15 The output of the frequency decomposer 322 is connected to the LFC generator 

120 and the HFC generator 124. An image retrieving device 320 retrieves images for 
comparison to the query image from at least one of the image databases 92, 94, and/or 
98. The unage retrieving device 320 outputs the images to the frequency decomposer 
322 which similarly performs wavelet ttansformation, DCT, DFT, FFT, or other 

20 suitable frequency domain transformation. 

The output of the frequency decomposer 1 16 is input to the LFC generator 120 
and the HFC generator 124. The output of the LFC generator 314 and 324 are input 
to a LFC weighting device 330. The output of the HFC generators 316 and 326 is 
input to a HFC weighting device 340. After suitable weighting is applied, an S- 



distance generator 342 generates the S-distance measurement. 

The S-distance measurement performed on frames r and r+7 can be used on 
the query image and database image. S(t,t+1) is replaced by Sf<SJi) where Q 
represents the query image and Sn represents the wth image in the database. 

5 s(a^) = ^SL(a^) + ^SH(a^) =^\Sl(Q), Sta) \ + ^Sh( ^, I 

= ^ I QsfSLi^l-^-'nSLiS) I + ^ I Q/f ^SH((3}-Qjl3tSH(3^) I 

The unages with the least S-distance measurement to the query image Q can then be 
returned in order of highest to lowest similarity as retrieval resuks in a manner similar 
to text-based browsing and searchmg. 

10 As can be appreciated, the query image is compared to multiple images from 

the image databases 92, 94, and/or 98 and the S-distance measurement defines the 
relative similarity between the query image and the database image. Subsequently, the 
processor 84 and memory 86 arranges the query results in order of highest to lowest 
similarity and outputs the query results to one of the I/O devices 104 for selection by 

15 the user. 

Referring now to FIG. 13, a second embodiment of the image retrieval system 
is illustrated at 350. A query image capture device 352 captures a query image as 
described above. An image retriever 354 retrieves unages for comparison with the 
query image. Depending on how the image is stored, the output of the image 

20 rettieving device 352 is input to a frequency decomposer 356, to a LFC signal 
generator 358 and a HFC signal generator 360, to a LFC and HFC weighting device 
364 as indicated by dotted Ime 365. Processing of the S-distance measurement is 
similar to that described above with respect to FIG. 12. By eliminating some of the 
processing on the database images, computational efficiency can be unproved. 

25 From the foregoing, it will be understood that the invention provides an 



image retrieval system that generates a list of possible database images that match a 
query image based upon the similarity between the database image and the query 
image. Skilled artisans can appreciate that while discrete functional blocks have 
been identified in FIG. 5, 6, and 8, these functions can be combined into larger 
5 functional blocks which perform multiple functions. The image retrieval system 
allows large databases to be searched for images. Browsing and searching extensive 
image databases is dramatically simplified. 

While the invention has been described in its presently preferred embodiments, 
it will be understood that the invention is capable of certain modifications and changes 
10 without departing from the spirit of the invention as set forth in the appended claims. 
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CLAIMS 

What is Claimed is: 

1. A video segmentation system comprising: 

a video source that provides a video sequence that includes a plurality 
of frames each including multiple pixels; 

a frequency decomposer connected to said video source that generates 
a low frequency signature for each of said plurality of frames; and 

a cut detector connected to said video source and said frequency 
decomposer that identifies a cut transition between two adjacent frames using said 
low frequency signature. 

2. The video segmentation system of Claim 1 wherein said low 
frequency signature includes a first set of jc by j coefficients. 

3. The video segmentation system of Claim 1 wherein said video 
sequence is in a compressed format, wherein each of said frames includes a plurality 
of blocks that include multiple pixels and wherein each of said blocks has a direct 
current (DC) luminance signal and an alternating current (AC) luminance signal. 

4. The video segmentation system of Claim 1 wherein said frequency 
decomposer employs at least one of wavelet decomposition, discrete Fourier 
transformation (DFT), and discrete cosine transformation. 
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5. The video segmentation system of Claim 4 wherein said frequency 
decomposition employs wavelet decomposition using a Haar transform. 

6. The video segmentation system of Claim 2 wherein said cut detector 
includes: 

a cut threshold generator that generates a cut threshold signal; 
a difference signal generator connected to said frequency decomposer 
5 that generates a difference signal by comparing said first set of x by j coefficients 
for a first frame with said first set of a: by j coefficients for a second frame that is 
adjacent said first frame; and 

a comparator coimected to said cut threshold generator and said 
difference signal generator that identifies a cut transition between said two adjacent 
10 frames if said difference signal exceeds said cut threshold signal. 

7. The video segmentation system of Claim 1 wherein said difference 
signal generator applies a weighting function before calculating said differencing 
signal. 

8. The video segmentation system of Claim 2 wherein said frequency 
decomposer generates a high frequency signature including a second set of x by y 
coefficients for each of said plurality of frames. 

9. The video segmentation system of Claim 8 wherein said cut detector 
identifies first and second cut transitions. 

21 



10. The video segmentation system of Claim 9 further comprising; 

a fade detector that identifies a fade transition using said high 
frequency signature for a sequence of adjacent frames of said video sequence that 
are located between said first and second cut transitions. 

11. The video segmentation system of Claun 10 wherein said fade 
detector further includes: 

a linear signal generator that assigns a fade threshold value for each 
of said frames that are located between said first and second cut transitions; 
5 a summing signal generator that provides a sum signal for each of 

said frames located between said first and second cut transitions by adding said 
second set of x by y coefficients in said high frequency signature; and 

a comparmg circuit connected to said linear signal generator and said 
Slimming signal generator that compares said sum signal with said fade threshold 
10 value for each of said frames located between said first and second cut transitions 
and declares a fade transition when said sum signal and said fade threshold are 
approximately equal for each of said frames located between said first and second 
cut transitions. 



22 



12. The video segmentation system of Claim 11 wherein said hnear 
function signal generator provides a decreasing linear signal to identify a fade out 
transition and an increasing linear signal to identify a fade in transition. 

13. The video segmentation system of Claun 9 further comprising: 

a dissolve detector that identifies a dissolve transition usmg said low 
frequency signature and said high frequency signatures for adjacent frames of said 
video sequence located between said first and second cut transitions. 

14. The video segmentation system of Claim 13 wherein said dissolve 
detector includes: 

a dissolve segments collector that identifies potential starting and end 
points of said dissolve transitions; and 

a dissolve transition verifier that verifies said potential start and end 

points. 
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15. The video segmentation system of Claim 14 wherein said dissolve 
segments collector includes: 

a summer that generates a sum signal based on said high frequency 
components; 

5 a dissolve generator that generates a dissolve signal; 

a difference generator that is connected to said summer and said dissolve 
generator and that generates a difference signal based on said sum signal and said 
dissolve signal; and 

a start and end identifier that is connected to said difference generator and 
10 that identifies said potential start and end points of said dissolve transition. 

16. The video segmentation system of Claim 15 wherein said dissolve 
segments collector includes: 

a smoothing filter that is connected between said summer and said 
difference generator. 

17. The video segmentation system of Claim 16 wherein said dissolve 
transition verifier employs a double irame differencing algorithm on said potential 
start and end points. 
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18. A video retrieval system comprising: 
a query image source; 

a database containing a plurality of images; 

a frequency decomposer connected to said video source and said 
5 query image source that generates low and high frequency signamres for said query 
image and said database images; and 

an S-distance generator that generates an S-distance measurement for 
each of said database images. 



19. The video retrieval system of claim 18 wherein said S-distance 
generator compares said low and high frequency signals of said query image and 
said database image to generate said S-distance measurement. 

20. The video retrieval system of claim 19 wherein said database images 
are returned m order of highest to lowest similarity based on said S-distance 
measurement. 

5 
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VIDEO INDEXING AND IMAGE RETRIEVAL SYSTEM 
ABSTRACT OF THE DISCLOSURE 
A video segmentation system generates an S-distance measurement that is a 
representation of the similarity between adjacent frames of a video sequence. The 
video segmentation system employs frequency decomposition of a direct current (DC) 
5 luminance signal of a compressed video sequence. High and low frequency 
component signatures are generated from a frequency-decomposed signal using 
wavelet transformation. A cut detector identifies cut transitions from the low 
frequency component signature. A fade detector identifies fade fransitions the high 
frequency component signature. A dissolve transition detector employs a double frame 
10 differencing algorithm to identify dissolve transitions. A video rettieval system 
likewise generates an S-distance between a query image and a database image. The 
video retrieval system employs the low and high frequency component signature to 
generate the S-distance measurement of the similarity between the query image and the 
database image. The results of the S-distance measurement allow browsmg and 
15 searching of the similar database unages. 
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